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We experiment with a massively parallel implementation of an algorithm for simulating the 
dynamics of metastable decay in kinetic Ising models. The parallel scheme is directly applicable 
to a wide range of stochastic cellular automata where the discrete events (updates) are Poisson 
' arrivals. For high performance, we utilize a continuous-time, asynchronous parallel version of the 

7i- fold way rejection-free algorithm. Each processing element carries an Ixl block of spins, and we 
employ the fast SHMEM-library routines on the Cray T3E distributed-memory parallel architecture. 
' Different processing elements have different local simulated times. To ensure causality, the algorithm 

handles the asynchrony in a conservative fashion. Despite relatively low utilization and an intricate 
relationship between the average time increment and the size of the spin blocks, we find that for 
sufficiently large I the algorithm outperforms its corresponding parallel Metropolis (non-rejection- 
free) counterpart. As an example application, we present results for metastable decay in a model 
ferromagnetic or ferroelectric film, observed with a probe of area smaller than the total system. 



I. INTRODUCTION 
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\ Fast and efficient algorithms are invaluable ingredients for large-scale simulations in physical sciences and engi- 
c"j . neering. The implementation of efficient, massively parallel algorithms for Monte Carlo simulations is not only an 
interesting and challenging problem, but also one of the most complex ones in parallel computing. It belongs to 
the class of parallel discrete-event simulation (sometimes referred to as distributed simulation) which has numerous 
applications in engineering, computer science, and economics, as well as in physics Jp. For example, in lattice Ising 
t-H , models the discrete events are spin updates, while in queueing networks they are job arrivals. The dynamics of 
^ ■ these systems, which obviously contain a substantial amount of parallelism, were traditionally simulated on serial 
computers. Paradoxically, it is fairly difficult in practice to implement an efficient parallel algorithm to simulate these 
dynamics, mainly due to the fact that the discrete events (updates) are not synchronized by a global clock. Here we 
present and analyze the performance of a variation Q of the n-fold way algorithm |3|,^| to simulate magnetization 
switching in a kinetic Ising model on a distributed-memory parallel computer. 

Metastability and hysteresis are widespread phenomena in nature. Ferromagnets are common systems that exhibit 
Q\ , these behaviors B, but there are also numerous other examples ranging from ferroelectrics Q and electrochemical 
1 adsorbate layers 0] to liquid-crystals j^] . An important potential technological application of nanoscale ferromagnetic 
particles and ultrathin films is high-density magnetic recording media, and computational experiments and modeling 
g ■ of these materials are integral parts of current research and engineering. 

Simulating metastable decay often involves long characteristic time scales (the lifetime of the metastable phase), 
and several sophisticated algorithms have been developed for serial computers d]4||| to tackle this problem. Common 
testbeds for these algorithms are kinetic Ising ferromagnets below their critical temperature T c , which exhibit slow 
metastable decay after reversal of the external magnetic fiel d |To[ . These models are appropriate for the study of 
highly anisotropic single-domain nanoparticles and thin films |llf|. 

There are powerful techniques, such as multi-spin coding [fl2| and cluster algorithms |13 14j] , to simulate the equilib- 



rium properties of the Ising model, but these methods completely change the original microscopic dynamics. Kinetic 
Ising models, either with integer-time updates or with Glauber's continuous-time interpretation Jl5|, were believed 
to be inherently serial, i.e., the corresponding algorithm could attempt to update only one spin at a time. Providing 
a counter example to that belief, Lubachevsky presented an efficient conservative approach for parallel simulation of 
these systems [|2| without changing the dynamics of the underlying model. Applications of his scheme also include 
modeling of wireless cellular communication networks Jl6[ and ballistic particle deposition flr7| . Also, he proposed a 
way to incorporate the rt-fold way algorithm, possibly further contributing to speedup. Here, we implement this al- 
gorithm on the isotropic, two-dimensional Ising model, and we systematically compare its performance to the parallel 
Metropolis algorithm. To our knowledge, this is the first actual implementation of the parallel n-fold way scheme. 
More importantly, detailed comparison between the two parallel schemes has not been given before. As we show in 



1 



this paper, there is an interesting competition between their performances, and detailed analysis is essential to decide 
which one to apply in different parameter regimes. 

This paper is organized as follows. In Section II we define the model and summarize the standard Metropolis and 
rejection-free serial algorithms. In Section III we outline the basic conservative approach for parallel discrete-event 
simulation, applied to Ising spins, and describe the parallel Metropolis and n-fold way algorithms. In Section IV we 
give some details of the implementation and analyze its performance. In particular, we compare it to that of the 
parallel Metropolis update scheme. In Section V we give some examples of ongoing and future applications to model 
magnetic and ferroelectric systems, and in Section VI we conclude with a brief summary. 

II. THE MODEL 

We simulate magnetization switching in the isotropic Ising model onaLxL square lattice with periodic boundary 
conditions, which has the following Hamiltonian: 

L 2 

n = -j^2s iSj -Hj2 Si . (i) 

(ij) i=l 

Here J > is the ferromagnetic nearest-neighbor spin-spin interaction, H is the external field, the sum Yluj) runs over 
all nearest-neighbor bonds, and runs over all L 2 lattice sites. To study metastable decay, all spins are initialized 
in the +1 state, and we apply a negative magnetic field (i.e., a sudden field reversal) at constant T < T c . Here T c 
is the critical temperature of the zero-field Ising model, below which the system spontaneously orders. For T<T C 
the decay of the metastable phase proceeds through nucleation and growth of one or more compact droplets of the 
stable phase. For fixed T and H there exists a system size, approximately the typical droplet separation R (T, \H\), 
such that for L>R a , many droplets form and contribute to the decay of the metastable phase. This is called the 
multidroplet regime |]lof . For the simulations presented in this paper the parameters are chosen such that the system 
is in this regime. In particular, i? «275 at T=0.6T C and |if|/J=0.2222, and i? «12 at T=0.8T C and 1771/7=0.4127 
are the largest and smallest values of R a for the temperatures and fields used in the following sections. 

A. The serial Metropolis algorithm 

In the standard serial Monte Carlo (MC) algorithm Jl5|] where time is a discrete variable taking on integer values, 
we choose the single- spin- flip Metropolis rates, according to which at every MC step (MCS) a spin is picked randomly 
on the lattice and flipped with the probability 

p = min{l,exp(-A7Y/fcT)} , (2) 

where AH is the energy change which would result from the flip. 

It is important to note that in this standard algorithm, the choice of fixed integer time increments (i.e., At = 1 
MCS between two successive spin-flip trials) is a convenience rather than a necessity: the underlying dynamics of the 
real physical system corresponds to a continuous time evolution, in which spin flips are Poisson arrivals. Thus, the 
time increment between two successive events is an exponentially distributed random variable. To exactly mimic the 
Poisson arrivals one should generate random time increments by 

At = - ln(r) (3) 

in units of MCS, where r is uniformly distributed in (0, 1). Clearly, this is computationally more expensive than the 
simple integer-time update and usually yields identical results when averaged over many independent runs. This extra 
cost, however, can pay off when mapping the system onto a parallel computer, as we shall see in Sections III and IV. 

An important quantity of interest is the average lifetime (r) of the system, which is the time needed to exit the 
metastable phase. A possible way to estimate this quantity is to keep track of the time series of the magnetization 

1 L2 

m = s i . (4) 

i=l 

which approximately equals —1 in the equilibrium phase. An average of its first-passage time to zero (i.e., 
m(t = t) = 0) over many independent escapes from the metastable state will then yield (r). 
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The weakness of the standard Metropolis algorithm (with both integer and continuous time) is the low acceptance 
rate of spin- flip trials at low temperature and low field, which is a result of the small flipping probability p in Eq. (|^). 
For example, at T=0.7T C and |i?|/J=0.2857 the fraction of successful spin-flip attempts is only about 5%. One may 
overcome this "waste of trials" by using a rejection-free update scheme as summarized in the following subsection. 



B. The serial n-fold way algorithm 



In the the n-fold way update scheme |3j, a spin flip is always performed, and the simulated time is incremented 
appropriately. To implement the scheme, one must introduce the notion of spin classes which carry the state of the 
spin itself and its neighbors. In the above model there are ten such classes, characterized by the number of spins in 
class i, rii, and the flipping probability, pi, which is the same for all spins in a class. Since the classes are disjoint, 
Si=i n i = L 2 . When performing an update, a class is first chosen according to the relative weights {riiPi}|£ 1 , then 
one of the spins in the class is picked with equal probability, 1/rij. Once the information on the classes has been 
updated, in particular the rtj's, the time of the next update is determined. As in the standard (non-rejection- free) 
update scheme, the n-fold way algorithm can be performed in either integer or continuous time Q . In both cases the 
time increment is a random variable. For the integer-time case, it is a geometrically distributed random number, 



At = INT 



ln(r) 



ln(l-r) 

while for continuous time it is exponentially distributed, 



At: 



ln(r) 



(5) 



(6) 



Here. 



r = 



Li=l n iPi 

L 2 



(7) 



is the inverse of the average time needed to exit the configuration specified by {n^}, and r is a uniformly distributed 
random number on (0, 1) For both cases, At is given in units of MCS. At low temperatures the p^s of the 

dominant classes (the ones with high populations) can be very small, resulting in large typical time increments. For 
example, at T=0.7T C and |-ff |/J=0.2857 the mean time increment is approximately 20 MCS. This is how the algorithm 
"bypasses" a large number of unsuccessful flip attempts and can increase the efficiency by several orders of magnitude 



Finally, it is important to note that both the standard Metropolis and the n-fold way algorithms yield identical 
physical results. They are simply two different implementations for simulating the same dynamics. 



III. PARALLELIZATION 



A. The parallel Metropolis algorithm 

First we review the conservative scheme for parallelization of the standard Metropolis algorithm An obvious 
way to parallelize the serial Metropolis algorithm is to spatially decompose the LxL lattice into Ixl blocks. When 
mapping it onto a parallel computer, each processing element (PE) carries an Ixl block of spins, and the number of 
PEs is Ape = (L/l) 2 (Fig. [I]). Here we outline the algorithmic steps for the continuous-time case. Each PE carries 
its own local time t. Initially a spin configuration, corresponding to t = 0, is chosen, and the time of the first update 
is determined by t = — ln(r), independently on each PE. In our case, the initial configuration is s.j=+l for all spins. 
For clarity, our time unit will be one MCS per PE (MCSP), during which each PE attempts to update one spin on 
average. Each PE is responsible for updating the N = I 2 spins that it carries by iterating the following steps: 

1. Select a spin from the block with equal probabilities. 

2. (a) If the chosen spin is in the kernel of the block, go to Step 3. 

(b) If the chosen spin is on the boundary of the block, wait until the local simulated time t of the next update 
becomes less than or equal to the same quantity for the corresponding neighboring PE(s) (for our model 
on a square lattice, 2 PEs for corner spins, one PE otherwise), then proceed to Step 3. 
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3. Update the state of the spin using the same probabilities as in the standard serial algorithm [Eq. 




4. Determine the time of the new next update by using the local time increment At = — ln(r), where r is a uniformly 
distributed random number on (0, 1). 

With reliable random number generators (an independent one on each PE), and using a continuous probability 
density (exponential) for the time increments, the probability of equal-time nearest-neighbor updates is of measure 
zero. Thus, no global barrier synchronization is necessary among the PEs to preserve the uniqueness of the simulated 
trajectory, provided the same set of random seeds are used initially. The algorithm is obviously free from deadlock, 
since in the worst-case scenario the PE with the minimum local time is able to make progress. It is clear from the 
above asynchronous algorithm that at any given (wall clock) moment, different PEs have different local simulated 
times. The "wait until" directive in Step 2., however, ensures that the information passed between neighboring PEs 
does not violate causality. This can be seen from the following example. Suppose that PE4 is currently updating spin 
Si, which is on the boundary of its block (Fig. Q). This update is possible if the local update time on PE4, £4, is not 
larger than that on PE5, £5. The update is also necessarily correct since Sk is guaranteed to be in the same state that 
it was in at £4. This is so because updates for PE5 have been forbidden by the "wait until" directive once t§ became 
larger than £4. 

The above asynchronous algorithm is suitable for a continuous-time update scheme, but it can cause inconsistency 
when integer time is used. Then, explicit barrier synchronization should be incorporated (synchronous algorithm) to 
treat equal-time spin-flip trials of nearest-neighbor spins without ambiguity, and to ensure the reproducibility of a 
simulated path, provided the same set of random seeds is used. However, these (independent) nearest-neighbor, equal- 
time events violate detailed balance [with respect to the Hamiltonian (Q)], and are unprecedented in the corresponding 
serial algorithm. Thus, we conclude that to be faithful to the original kinetic Ising model, one must employ the 
continuous-time update scheme. Other, more general cellular automata may tolerate nearest-neighbor, equal-time 
updates, since in such cases the microscopic update rates are not necessarily related to an a priori known equilibrium 
probability distribution, and the corresponding equal-time events may very well represent the real physical behavior. 
Also note that frequent barrier synchronization can be very "expensive" on a distributed-memory architecture, such 
as the T3E, where each PE executes the code completely independently of all other PEs. This is especially true for 
Monte Carlo algorithms, in which the core of the update routine is extremely simple and the time a PE spends at a 
barrier waiting for the others can easily exceed the time it is active. 



B. The "shielded" parallel rt-fold way algorithm 

A natural idea to parallelize the n-fold way algorithm is to employ the same spatial decomposition as for the 
parallel Metropolis algorithm, and to apply the serial rejection- free update scheme directly on each block. However, 
one cannot simply run a copy of the serial n-fold way algorithm on each PE: due to the nearest-neighbor interaction, 
the local time increments and the class populations depend not only on the spin values in the block carried by that 
PE, but also on the states of the spins located on the adjacent boundaries of the neighboring PEs. By updating a 
spin on the boundary of a block, the updating PE could corrupt the simulated history of a neighboring block, if the 
local time of the latter is already ahead of that of the former. Thus, the neighboring PE would need to perform a 
rollback procedure to recover from the simulation of a series of "false" events. It is easy to see that this rollback might 
generate a cascade of rollbacks in other PEs, making the implementation rather difficult. Moreover, it is unclear how 
frequently such rollbacks would occur, and how far they would go back in time and propagate through space. 

Lubachevsky proposed a way to avoid rollbacks by modifying the original algorithm 01. In each block, an additional 
class is defined which contains the spins on the boundary. The weight of this class is the number of boundary spins, 
N\, = 4(/ — 1), which does not change during the simulation. The original n-fold way tabulation of spin classes is only 
used in the kernel of the block. Hence, Ab + X)i=i n i ~ where Yli=i n i ~ is the number of spins in the kernel, 
and N = I 2 is the total number of spins in a block. Note that the population of the classes, {At,, is maintained 

separately for each block by the corresponding PE. Once the initial configuration is set and the corresponding local 
time of first update is determined, the asynchronous algorithm (with continuous time) consists of the following iterated 
steps performed by each PE: 

1. Select a class according to the relative weights {Ab, {n%Pi}\^-\}i an d select a spin from the chosen class with 
equal probabilities. 

2. (a) If the chosen spin is in the kernel, flip it with probability one and go to Step 3. 

(b) If the chosen spin belongs to the boundary class, wait until the local simulated time of the next update 
becomes less than or equal to the same quantity for the corresponding neighboring PE(s). Then the state 
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of this spin may or may not change: flip it with the usual Metropolis probability [Eq. (g)] and proceed to 
Step 3. 

3. Update the tabulation of the spin classes {^j}i=i m the kernel. 

4. Determine the time of the new next update (in units of MCSP) by using the local time increment, 
At = — ln(r)/r s , where 

N h + Ej£i riiPi m 

and r is a uniformly distributed random number on (0, 1). 

Thanks to the introduction of the class of boundary spins (which has fixed weight) , the neighboring PEs are shielded 
from each other, since a spin flip on the boundary of a block does not effect the tabulation of spin classes in an adjacent 
one. Hence, just like in the standard parallel scheme, the same conservative approach (the "wait until" directive in 
Step 2.) ensures that the information passed between PEs is valid and the updates are correct. 

Again, (employing proper synchronization) one can experiment with integer-time updates. Although in this case 
time increments are random integers and the probability of picking two nearest-neighbor spins residing on two adjacent 
PEs with equal update times is small, it is nevertheless nonzero. We have already argued that these (perhaps rare) 
events have no corresponding analogues in the serial algorithm. Furthermore, the use of barrier synchronization 
severely degrades performance. Although, for "experimentation" purposes on the architecture we implemented the 
integer-time, synchronous algorithm as well [H, we will not discuss its performance in detail here. 



IV. PERFORMANCE 



We implemented both the parallel Metropolis and the parallel ?i-fold way algorithms (as described in the previous 
section) on the Cray T3E parallel architecture at the National Energy Research Scientific Computing Center (NERSC). 
For message passing, we employ the Cray-specific, logically shared, distributed memory access (SHMEM) routines. 
The fast SHMEM library supports communication initiated by one PE, together with remote atomic memory opera- 
tions. Without these features, it would be extremely inconvenient to code an algorithm for stochastic simulation on a 
distributed memory machine, where the pattern of communication is completely unpredictable. These characteristics 
clearly outweigh the loss of portability of our code. 

To better understand the performance of our implementation we monitored the following quantities: 

Utilization. This is defined as the fraction of "non-idling" PEs, i.e., the ones which either pick a spin in the kernel 
or successfully pass the "wait until" directive in Step 2. of the algorithms. Since the routines are asynchronous (the 
main simulation cycles on each PE are not executed in lock-step) it is fairly difficult to measure this ratio. To obtain 
an estimate, we placed explicit barrier synchronization in our code and performed separate runs. The performance 
was irrelevant for these runs, the only information that we aimed for was the fraction of non-idling PEs in each (now 
artificially lock-step) main simulation cycle. We emphasize that the utilization only measures the fraction of PEs 
that are not idle. It does not say anything about whether the active PEs are doing anything "useful" in the sense of 
performing successful updates. 

The mean local time increment, At. Although At in the ?i-fold way algorithm is not stationary over the course 
of the metastable decay, its mean carries important information about how successfully the algorithm "bypasses" 
those spin-flip attempts wh ich w ould be rejected if the Metropolis algorithm were used. For the parallel Metropolis 
algorithm At — 1 trivially (ln(r) = 1). 

PE update rate. This quantity is literally the simulation speed of a PE in units of standard MCS per PE per 
second, i.e., MCSP/s. This is an "absolute" measure and in fact tells which parallel algorithm should be used for 
optimal simulation speed. The full update rate of the parallel algorithm is simply (PE update rate)xiVpE- 

In order to compare directly the performances of the parallel algorithms to those of their serial counterparts, we 
also ran the corresponding serial routines on one node of the T3E and determined the following measures: 

Efficiency. This is the ratio of the PE update rate of the parallel algorithm to the update rate of the corresponding 
serial algorithm using the same full system size L. For both parallel algorithms it is proportional to the utilization, 
and it is related to the communication speed of the architecture through the fraction of spin-flip attempts in which 
message passing is needed. For our algorithms, due to the fast communication hardware of the architecture, the 
communication overhead is a small effect compared to the inherent low utilization, which is a common drawback of 
conservative parallel methods. For the parallel n-fold way algorithm the efficiency is also proportional to the ratio of 
the typical time increments of the parallel and the corresponding serial routine, Ai par /A< scr . 
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Speedup. This is the ratio of the (full) update rate of the parallel algorithm to the update rate of its corresponding 
serial counterpart, i.e., efficiency xA^pe- 

Both efficiency and speedup are relative measures, and they merely indicate how successfully the parallelization of 
the corresponding serial algorithm is accomplished. As we shall see, a parallel code with lower efficiency can outperform 
one with higher efficiency, due to the faster "serial" core of the former. For very large systems, direct simulation using 
the serial algorithms is not possible, due to memory limits. The largest system sizes we could allocate were L=2048 
for the serial Metropolis and L=1280 for the serial n-fold way algorithm. To obtain speedup and efficiency estimates 
for larger systems we extrapolated the smaller-system update rates of the serial routines. 

Before discussing the performance of the "shielded" n-fold way algorithm, we note two inherently weak features, 
which are not related to the otherwise fast communication hardware of the T3E parallel architecture. 

First, as a general guideline, the fewer communications one has to execute, the better is the performance of the 
parallel code. In our case, the probability of picking a spin on the boundary, which will be followed by some kind 
of communication between neighboring PEs, is greater than the surface-to-volume ratio, N^/N = 4(1 — l)/l 2 ~ 4/1. 
In particular, it is determined by the relative weights in the modified n-fold way algorithm, which are given by 
N h /{N b + J2i=in iPi ). With very small pis this ratio can take unfavorably large values, close to unity, leading to 
more frequent message passing and, more importantly, idling if required by the "wait until" directive. 

Second, the typical time increment of the parallel algorithm is smaller than for the serial one. As mentioned in 
Section II, the advantage of the serial n-fold way routine lies in the large typical time increments that correspond 
to the very small flipping probabilities at low temperatures. However, for the same sets of class-specific flipping 
probabilities pi, the mean time increments of the "shielded" parallel and serial n-fold way routines are related as 



_1 EZi^ N k _N h 1 JV k __l N h _1 

Atpar ~ N N k N~N At scl . N Ai ser N At sc / ' 

This follows from Eq. (^) and implies that Ai par < Ai scr always. Thus at very low temperatures, where 
Aiscr 3> i/4, Aip ar is determined almost completely by the block size, rather than by the pi's and the populations of 
the corresponding classes: 

At par £ 1/4 . (10) 

We test the scaling of the parallel algorithms up to 400 PEs. First, the system size is kept constant, and we divide 
it into smaller and smaller blocks (Fig. |^). Then, alternatively, we keep the block size fixed and study the scaling 
for larger and larger systems by increasing the number of blocks (Fig. |3|) . We also carry out experiments with fixed 
number of PEs and varying block size (Fig. ||) to study in detail the effect of increasing surface-to- volume ratio of the 
blocks. Each of these studies were performed at T=Q.7T C and \H\J J=0.2857, where the typical droplet separation, 
R , is approximately 32 lattice constants. Finally, with fixed number of PEs and fixed block size we study the effect 
of the temperature and magnetic field on the performance (Fig. ||), to determine the regime of efficient applications 
to our particular model system. The results reflect the features discussed in the previous paragraph. 

A. Scaling with iVpE for fixed system size. 

For this series of runs we choose L=512 for the total system size and we employ Ape=4, 16, 64, and 256 (correspond- 
ing to Z=256, 128,64, and 32, respectively). For both the parallel n-fold way and the parallel Metropolis algorithm 
the utilization drops with decreasing block size (Fig. |2|a). The utilization for the n-fold way routine is significantly 
lower than for Metropolis: the probability of choosing a spin on the boundary is greater than the surface-to- volume 
ratio and then it is ultimately followed by an inquiry of the local time of the neighbor(s) and possible idling. Further, 
the typical time increments are decreasing as well (Fig. ||b). Note that for the serial n-fold way routine the mean 
time increment is Ai ser = 19.9 (Table ||). This is the only parameter in Eq. (^|), which gives complete agreement with 
the measured values of Ai par , as shown by the solid curve in Fig. ||b. As a result of these factors, the PE update rate 
of the parallel n-fold algorithm drops faster than that of the parallel Metropolis one (Fig. |^c inset). For sufficiently 
small blocks, the performance of the n-fold routine actually becomes poorer than for the Metropolis routine, as we 
shall see in the experiments with fixed number of PEs and varying block size. For both algorithms the scaling is 
systematically worse than linear, for the reasons explained above ( Fig. || c). In particular, the parallel n-fold way for 
large number of PEs (small block size I) cannot scale better than V A^peT since for fixed L the typical time increment 
scales as Ai par ~ 1/4 ~ I/V-^pe- Although employing many small blocks does result in speedup, the efficiency clearly 
becomes poorer at the same time (Fig. ||d) . 
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For completeness, we mention that the performance of the integer-time, synchronous n-fold way routine becomes 
progressively weaker than that of the asynchronous version with continuous time for increasing number of PEs. For 
example, with 256 PEs it runs 2.3 times slower than the continuous-time routine, due to the necessity of explicit 
barrier synchronizations. It is even slower than the asynchronous Metropolis routine by a factor of 2.1. 

B. Scaling with iVpE for fixed block size. 

Here we keep the size of the blocks fixed (using three different values: Z=64, 128, 256), while increasing the system 
size by using larger numbers of PEs (iVpE=4, 16, 64, 256, and 400). Clearly, the larger the block size, the higher the 
utilization. Even with fixed I, the utilization monotonically decreases with increasing iVpE. However, it appears to 
approach a nonzero value for large numbers of PEs (Fig. j^a) . As was pointed out in Ref. [||, it is a highly non- 
trivial mathematical problem to prove that the utilization tends to a nonzero value as Ape — * oo. For practical 
purposes, given our architecture with 512 nodes this question might seem academic, but it truly lies at the heart of 
this conservative approach to discrete-event simulation. Again, the utilization for the parallel Metropolis algorithm 
significantly exceeds that for the n-fold way: for the Metropolis routine with Z=128 it is 82% with 400 PEs, while for 
the n-fold way it is only 56%, even for 1=256. With fixed I, the mean time increments are independent of Nps in the 
parallel n-fold way routine (Fig. |^b). However, increasing I systematically improves At and thus the performance. We 
find almost linear scaling of the update rate with Npe for both parallel algorithms (Fig. ||c,d). Despite its relatively 
low utilization, the n-fold way routine clearly outperforms the Metropolis one, due to its large time increments (partly 
rejection- free nature). 

Again we note that our implementation of the integer-time synchronous n-fold way algorithm performs rather poorly 
compared to its asynchronous counterpart: it runs 2.3 times slower than the asynchronous n-fold way and 1.3 times 
slower than the asynchronous Metropolis routine with Z=128 and 256 PEs. 

C. Simulation with fixed number of PEs and varying block size. 

Here Ape=64, which can be regarded as a reasonable number for production runs. We increase the block size 
(1=16, 32, 64, 128, 256, 512, and 1024) and "stretch" the routine close to its memory limits. For both algorithms the 
utilization increases with increasing block size. This happens faster for the parallel Metropolis routine, for which 
the utilization is more directly related to the surface-to-volume ratio of the blocks (Fig. ||a): for Z=1024 it is 93% 
while it is 74% for the n-fold way routine. Further, the mean time increment for the parallel n-fold way algorithm 
systematically approaches that of its serial counterpart, Ai scr =19.9 (Fig. [|b and Table |). For large I such that 

1/4 ^> At scr , Atpar ~ At ser , as expected from Eq. (||). The data are in complete agreement with our theoretical 
result, given by the solid curve. Here the temperature and field are moderate, so that the increasing block size I 
allows Atp ar to "catch up" with At sor . The drawback is that employing large blocks can lead to excessive memory 
usage. The PE update rate of the n-fold way routine with /=1024 slightly decreases compared to 1=512 (Fig. ^c), 
even though the utilization and the mean time increment slightly increase. In fact, Z=1024 is fairly close to the largest 
block size we could allocate in the memory of one node. For very small / (Z< 16) , the utilization and At of the n-fold 
way routine are so low that the routine is actually outperformed by the parallel Metropolis one (Fig. ^c). 

D. Simulation with fixed number of PEs and varying physical control parameters. 

Here ./Vpe=64 and Z=128 are kept constant while we vary the magnetic field for three different temperatures to 
study the effects of changing the characteristic length and time scales of the simulated system. As expected, the 
utilization for the Metropolis algorithm is unaffected (Fig. ||a); the slight increase in the PE update rate at lower 
fields (Fig. |B|c) or temperatures is due to the fact that for higher rejection rates fewer variables need to be updated. 
The parallel n-fold way routine suffers from low utilization at low temperatures and fields: in a high percentage of 
the execution time of the main simulation cycle, a spin on the boundary is picked. This not only implies necessary 
communications with its neighboring PE(s), but possible idling if the local times of the neighboring blocks are behind. 
Although the time increments increase with decreasing temperature and field, they cannot exceed 1/4 as shown in 
Eq. @ (Fig. |b). Nevertheless, even with the bounded time increments, the parallel n-fold way routine outperforms 
the parallel Metropolis one (Fig. ^p). The drawback of the parallel (partially rejection-free) n-fold way scheme is the 
low efficiency with respect to its serial counterpart (Fig. ||d); for example at T=0.6T C and \H\/J=0.2222 it is only 



7 



13.2%, corresponding to a speedup of only 8.4. For comparison, the mean time increments of the serial n-fold way 
routine are also given in Table ||. 

V. APPLICATIONS 

As discussed in Section II, below the equilibrium critical temperature the kinetic Ising system exhibits metastable 
decay after an instantaneous magnetic field reversal from \H\ to — \H\. Using standard droplet theory M), one can 
show that a thermally nucleated domain of the stable phase must reach a critical droplet size, corresponding to a 
(temperature and field dependent) critical radius R c , before its growth becomes energetically favorable. For systems 
significantly larger than the typical droplet separation R a (which decreases in a nonlinear fashion with increasing 
| H | and T), many droplets of the stable phase form and grow until they coalesce and occupy the whole system 



(multi-droplet (MD) regime) [fLO 19 1. The parameters we choose correspond to this decay mode. Note that i? c <g;i? , 
in particular, R c /R a — > in the \H\ — > limit. A quantity of interest is the lifetime of the metastable phase, (r), 
which is defined as the average first-passage time to zero magnetization. Exploiting that the system is self-averaging 
in this regime, one may employ the classical Avrami's law for homogeneous systems pp| | in two dimensions to obtain 
an analytical form for the time evolution of the system magnetization 10 1S§] : 

m(t) « ( ?Tl ms 772 s ) 0ms {t)+m S) (11) 

where 

<^ns(*) = e- (ln2)(t/ < T > )3 (12) 

is the volume fraction of the metastable phase, and m ms and m s are the metastable and the stable (equilibrium) 
magnetization, respectively. 

For illustration we choose an L=1024 system at T—0.7T C and \H\/ J=0.2857, and we study metastable decay as 
observed through a 6x6 window. Employing different block sizes, 6, mimics the effect of choosing different finite 
(smaller than the system size) observation windows in a large system. This is clearly relevant to real experiments 
and observations, such as those using the magnetooptical Kerr effect ||l|]. We run our application with Ape=16,64, 
and 256, directly corresponding to block sizes 6=Z=256, 128, and 64, respectively. Since the maximum number of PEs 
available to us is 512, for 6<32 (for the same L=1024 system) we employ 256 PEs with 1—64, and monitor observables 
for a bxb block within the 64x64 subsystem carried by each PE. Even our smallest block, 6=8, is sufficiently large 
that subcritical fluctuations of size R < R c « 2 are not recorded as first-passage times to zero block magnetization. 
Together with the global magnetization, Eq. (0), we monitor the individual block magnetizations 



1 " 2 



Typical time series of these quantities are shown in Fig. |^, where time is measured in units of MCS/spin (MCSS). 
Recording the first-passage times to zero for at least 100 escapes, we calculate the mean and the standard deviation 
of the lifetime (Fig. ^a) and construct histograms for P^ ot (i) , the probability that the system or block magnetization 
has not changed sign by time t (Fig. Qb). 

The average lifetime of a finite subsystem, as observed through a finite window, does not differ significantly from 
the lifetime of the whole system ((r)=158.5 MCSS). However, the statistical properties of the lifetime change not only 
quantitatively, but also qualitatively with b. For b much larger than the typical droplet separation (R ss 32 lattice 
constants at this temperature and field), the time-dependent block magnetization itself is self-averaging, the switching 
process is Gaussian, and consequently P^ ot (t) is an error function flO]] . While the average lifetime within a block is 
unaffected, its standard deviation, erf,, is proportional to 1/6 in two dimensions. Once 6 becomes comparable to or 
smaller than the above picture is no longer valid and a crossover to a qualitatively different behavior is observed. 
For these smaller blocks, the switching within a block is not related to the growth of several droplets nucleated within 
that block, but rather to a single droplet formed within the same block, or to droplets formed elsewhere in the system 
which propagate across the observed block. It is a challenging theoretical problem to describe the switching behavior 
analytically in this crossover regime, and work is in progress to accomplish this task. In the b/R — > limit the 
coarse-grained approximation (in which the size of the critical droplet is negligible) yields P° ot (t) « </> m s(i) [Eq. (p"2|)], 
since the probability that the block magnetization has not switched by time t then becomes equal to the volume 
fraction of the metastable phase. The standard deviation in this limit can be obtained from the probability density of 
the first-passage-time, —dP° ot /dt, yielding a Q « 0.1345(r) « 58.12 MCSS. Note, however, that for 6 smaller than the 
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diameter of the critical droplet (2R C w 4 lattice constants for our parameters) the above argument (which is based on 
the coarse-grained picture) is no longer valid, since zero-crossings of the block magnetization are frequently induced 
by subcritical droplets. 

Another future application in the MD regime is to study the response to a periodic applied magnetic field, which 
exhibits highly nontrivial hysteretic behavior p2| . If the half-period of the applied field is less than the metastable 
lifetime (t), the system almost always does not switch, resulting in a nonzero period-averaged magnetization ("dy- 
namically ordered phase"). On the other hand, when the half-period exceeds (t), the magnetization switches in almost 
every half-period and the period-averaged magnetization is zero ( "dynamically disordered phase" ) . The transition 
between these two phases is sharp and singular: the system exhibits a dynamic phase transition, which fits into the 
general framework of critical phenomena and continuous phase transitions [p3[ . Here there is clearly a need to study 
scaling and universality by obtaining the corresponding critical exponents with high accuracy. Also in this regime, our 
parallel algorithm appears to be very efficient for large systems. We plan to implement it with a periodic square-wave 
shaped applied magnetic field and carry out a large-scale finite-size scaling analysis of the dynamic phase transition. 

VI. CONCLUSION 

We have studied the performance of the parallel n-fold way dynamic Monte Carlo algorithm proposed by 
Lubachevsky in which each PE carries an Ixl block of random variables. The algorithm was implemented 
for a two-dimensional kinetic Ising ferromagnet undergoing metastable decay, but the parallel scheme is generically 
applicable to a wide range of stochastic cellular automata where discrete events (updates) are Poisson arrivals. To 
obtain reasonable performance on the T3E distributed-memory parallel architecture and to be faithful to the original 
dynamics, one must utilize an asynchronous update scheme with continuous time. Then the expensive global barrier 
synchronizations are avoided and spin-flip attempts are modeled as independent Poisson arrivals. We analyzed the 
performance of our implementation, which sensitively depends on the block size and the number of PEs, as well as 
on the characteristic length and time scales of the simulated system. We found that for large enough block size, 
the routine outperforms the standard parallel Metropolis algorithm. For moderately low temperatures it yields high 
speedups with respect to the already fast serial n-fold way algorithm. For example, at T=0.7T C and |if|/J=0.2857, 
employing 1=256 we obtained a speedup of 260 with 400 PEs, and for 1024 a speedup of 58 with 64 PEs. Often the 
system size (possibly large) is specified, and for fixed L, although significantly worse than linear, the speedup is still a 
monotonically increasing function of the number of PEs, up to the maximum 256 PEs that we studied. At the same 
time, the efficiency is monotonically decreasing, which results in larger total CPU time usage to execute the same task 
with a larger number of PEs. If one has unlimited resources (i.e., no allocation limits) this aspect is not relevant. For 
most, like us, who have limited CPU resources on a certain parallel architecture, "optimization" between speedup 
and efficiency can be important. Our implementation is obviously best suited to simulating large systems. 

On the other hand, for very low temperatures, the algorithm does not provide an efficient way to simulate metastable 
decay. The reason for the relatively narrow regime of efficient implementation lies in the introduction of a special 
class in the n-fold way algorithm which "shields" the blocks from each other, but significantly decreases the typical 
time increments. The algorithm avoids rollbacks, but pays a large price: it looses the arbitrarily large time increments 
that is the most important feature of the serial n-fold way algorithm, at arbitrarily low temperature and field. To 
obtain reasonable efficiency compared to the serial n-fold way algorithm, one needs to employ large blocks such that 
1/4 At S er, and clearly it is impossible to keep up with very large serial time increments by increasing I. 

One way to preserve the advantage of the original n-fold way algorithm in principle would be to apply it directly 
on each block (optimistic approach). This would require a complex protocol to correct erroneous computations. Such 
a rollback procedure would ensure the correct time ordering of simulated events. This mechanism is not unknown in 
distributed event simulation p4[ and certainly has potential in our model. The complexity of such an implementation, 
however, might carry a tremendous overhead with respect to the very simple and fast serial algorithm for the Ising 
model. Another possible way to improve efficiency, while avoiding a general rollback procedure, is to consider relaxation 
p5f which might use local speculative computations before scheduling an event f2q| . 
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TABLE I. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms with different block size 
I at T=0.7T C , |ff |/J=0.2857. They are approximately independent of the full system size L and Npe- (*) The mean time 
increment for the serial algorithm is approximately independent of L. 
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TABLE II. Mean time increments (in MCSP) for the serial and the parallel n-fold way algorithms for different temperatures 
and magnetic fields (Npe— 64, 1=128). 
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FIG. 1. Schematic diagram of the spatial decomposition of the system and its mapping onto a parallel machine. Here L=12 
and 1=4. Each of the Npe=(L/1) 2 =9 processing elements (PEs) carries l 2 =16 spins, confined by solid lines. The spins on the 
boundary are separated from those in the kernel by dashed lines. 
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FIG. 2. Scaling and performance analysis for fixed system size, L=512, at T=0.7T C and \H\/ J=0.2857, comparing the parallel 
ra-fold way algorithm (open squares) and the parallel Metropolis algorithm (filled squares) . The lines connecting the data points 
are merely guides to the eye except in (b), where they represent the theoretical prediction of Eq. (^|). (a) Utilization, (b) Mean 
time increment, At, for the parallel n-fold way algorithm, (c) Update rate (inset: PE update rate in units of 10 6 MCSP/s). 
(d) Speedup (inset: efficiency). 
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FIG. 3. Scaling and performance analysis for three different block sizes, Z=64, 128, and 256 for the parallel n-fold way 
algorithm (open circles, squares, and triangles, respectively), and Z=128 for the parallel Metropolis algorithm (filled squares) 
at T—0.7T C and |iZ|/J=0.2857. The linear system size is L=l^/ Npe- The lines connecting the data points are merely guides to 
the eye except in (b), where they represent the theoretical prediction of Eq. ([]). (a) Utilization, (b) Mean time increment, At, 
for the parallel n-fold way algorithm, (c) Update rate (inset: PE update rate in units of 10 6 MCSP/s). (d) Speedup (inset: 
efficiency) . 
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FIG. 4. Performance analysis for fixed number of PEs, iVpE=64, as a function of the block size, I, for the parallel n-fold way 
algorithm (open squares), and for the parallel Metropolis algorithm (filled squares) at T=0.7T C and \H\/J=0.2857. The lines 
connecting the data points are merely guides to the eye except in (b), where they represent the theoretical prediction of Eq. 
([]). (a) Utilization, (b) Mean time increment, At, for the parallel n-fold way algorithm, (c) Update rate (right scale) and PE 
update rate (left scale), (d) Speedup (right scale) and efficiency (left scale). 
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FIG. 5. Performance analysis for three different temperatures, T—0.8T C , 0.7T C , and 0.6T C for the parallel n-fold way algorithm 
(open circles, squares, and triangles, respectively), and T=0.7T C for the parallel Metropolis algorithm (filled squares) as functions 
of the magnetic field. We employ iVpE=64 and / = 128 (L=1024). The lines connecting the data points are merely guides to the 
eye. (a) Utilization, (b) Mean time increment, At, for the parallel n-fold way algorithm. Filled circles indicate the theoretical 
predictions of Eq. ([]). (c) Update rate (right scale) and PE update rate (left scale), (d) Speedup (right scale) and efficiency 
(left scale). 
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FIG. 6. Global and block magnetization time series for 10 different realizations at T=0.7T C and \H\/ J=0.2857. (a) For 
6=L=1024 (global). The thick solid line, which is difficult to distinguish from the simulation results, represents Avrami's law 
[Eqs. ([lj) and @]. (b) For 6=256. (c) For 6=128. (d) For 6=64. (e) For 6=32. (f) For 6=16. 



1G 




\/b t[MCSS] 



FIG. 7. (a) Standard deviation of the first-passage time of the block magnetization to zero (open triangles) as a function 
of the inverse block size, 1/6. The dashed line is merely a guide to the eye. The solid straight line represents the theoretical 
dependence on 6 of the standard deviation for large 6. The dotted horizontal line is the b/R — > limit of the standard 
deviation (a = 58.12 MCSS) within the coarse-grained approximation, (b) The probability that the block magnetization, 
observed through a 6x6 window, has not changed sign by time t after a sudden magnetic field reversal in a 1024x1024 Ising 
system. The solid curves for 6=1024, 256, and 128 are scaled error functions, given by theory jjl^] for large 6. The dotted curve 
for 6=8 represents the theoretical coarse-grained b/R —> limit, i.e., the metastable volume fraction, (f> ms (t) [Eq. (|l2])], given 
by Avrami's law. As expected, it fits the data very well for t<(r), where droplet coalescence is unimportant. 
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